geometry problem
Causal-R: ACausal-Reasoning Geometry Problem Solver for Optimized Solution Exploration
The task of geometry problem solving has been a long-standing focus in the automated mathematics community and is drawing growing attention due to its complexity for both symbolic and neural models. Although prior studies have explored various effective approaches for enhancing problem solving performances, two fundamental challenges remain unaddressed, which are essential to the application in practical scenarios. First, the multi-step reasoning gap between the initial geometric conditions and ultimate problem goal leads to a large search space for solution exploration. Second, obtaining multiple interpretable and shorter solutions remains an open problem. In this work, we introduce the Causal-Reasoning Geometry Problem Solver to overcome these challenges.
SOLIDGEO: Measuring Multimodal Spatial Math Reasoning in Solid Geometry
Geometry is a fundamental branch of mathematics and plays a crucial role in evaluating the reasoning capabilities of multimodal large language models (MLLMs). However, existing multimodal mathematics benchmarks mainly focus on plane geometry and largely ignore solid geometry, which requires spatial reasoning and is more challenging than plane geometry. To address this critical gap, we introduce SOLIDGEO, the first large-scale benchmark specifically designed to evaluate the performance of MLLMs on mathematical reasoning tasks in solid geometry.
Gold-Medal-Level Olympiad Geometry Solving with Efficient Heuristic Auxiliary Constructions
Duan, Boyan, Liang, Xiao, Lu, Shuai, Wang, Yaoxiang, Shen, Yelong, Chang, Kai-Wei, Wu, Ying Nian, Yang, Mao, Chen, Weizhu, Gong, Yeyun
Automated theorem proving in Euclidean geometry, particularly for International Mathematical Olympiad (IMO) level problems, remains a major challenge and an important research focus in Artificial Intelligence. In this paper, we present a highly efficient method for geometry theorem proving that runs entirely on CPUs without relying on neural network-based inference. Our initial study shows that a simple random strategy for adding auxiliary points can achieve silver-medal level human performance on IMO. Building on this, we propose HAGeo, a Heuristic-based method for adding Auxiliary constructions in Geometric deduction that solves 28 of 30 problems on the IMO-30 benchmark, achieving gold-medal level performance and surpassing AlphaGeometry, a competitive neural network-based approach, by a notable margin. To evaluate our method and existing approaches more comprehensively, we further construct HAGeo-409, a benchmark consisting of 409 geometry problems with human-assessed difficulty levels. Compared with the widely used IMO-30, our benchmark poses greater challenges and provides a more precise evaluation, setting a higher bar for geometry theorem proving.
Euclid's Gift: Enhancing Spatial Perception and Reasoning in Vision-Language Models via Geometric Surrogate Tasks
Lian, Shijie, Wu, Changti, Yang, Laurence Tianruo, Yuan, Hang, Yu, Bin, Zhang, Lei, Chen, Kai
Spatial intelligence spans a rich suite of abilities, including visualising and transforming shapes, mentally rotating objects, judging relational positions and containment, and estimating numerosity. However, it still remains a critical unresolved challenge for Multimodal Large Language Models (MLLMs). To fill this gap, we propose to treat Euclidean geometry problem-solving as a surrogate task. Specifically, we meticulously constructed a curated multimodal dataset, called Euclid30K, comprising approximately 30K plane and solid geometry problems. Furthermore, to enable the model to learn and apply Euclidean principles from these geometry problems, we fine-tuned seven model variants (spanning 3--72B parameters) from the Qwen2.5VL, Qwen3VL, and RoboBrain2.0 families using Group Relative Policy Optimization (GRPO), inspiring the models to identify shapes, count, and relate entities, and perform multi-step deductive reasoning using Euclidean principles. Our experiments demonstrate that the resulting models achieve substantial zero-shot gains across four spatial reasoning benchmarks (Super-CLEVR, Omni3DBench, VSI-Bench, and MindCube) without any task-specific adaptations. Notably, after training on the Euclid30K, the mean VSI-Bench accuracy rose from 36.6\% to 41.8\% (+5.2\%), and the mean MindCube accuracy rose from 31.4\% to 38.1\% (+6.7\%). To our knowledge, this is the first systematic study showing that geometry-centric fine-tuning can confer vision-language models with broadly transferable spatial skills. Code and Euclid30K dataset can be found in \href{https://zgca-ai4edu.github.io/Euclids_Gift}{this}.
An Ontology-Based Approach to Optimizing Geometry Problem Sets for Skill Development
Bouzinier, Michael, Trifonov, Sergey, Chen, Matthew, Venkatesh, Tarun, Rifkin, Lielle
Euclidean geometry has historically played a central role in cultivating logical reasoning and abstract thinking within mathematics education, but has experienced waning emphasis in recent curricula. The resurgence of interest, driven by advances in artificial intelligence and educational technology, has highlighted geometry's potential to develop essential cognitive skills and inspired new approaches to automated problem solving and proof verification. This article presents an ontology-based framework for annotating and optimizing geometry problem sets, originally developed in the 1990s. The ontology systematically classifies geometric problems, solutions, and associated skills into interlinked facts, objects, and methods, supporting granular tracking of student abilities and facilitating curriculum design. The core concept of 'solution graphs'--directed acyclic graphs encoding multiple solution pathways and skill dependencies--enables alignment of problem selection with instructional objectives. We hypothesize that this framework also points toward automated solution validation via semantic parsing. We contend that our approach addresses longstanding challenges in representing dynamic, procedurally complex mathematical knowledge, paving the way for adaptive, feedback-rich educational tools. Our methodology offers a scalable, adaptable foundation for future advances in intelligent geometry education and automated reasoning.
GeoFM: Enhancing Geometric Reasoning of MLLMs via Synthetic Data Generation through Formal Language
Zhang, Yuhao, Hu, Dingxin, Yu, Tinghao, Liu, Hao, Liu, Yiting
Multi-modal Large Language Models (MLLMs) have gained significant attention in both academia and industry for their capabilities in handling multi-modal tasks. However, these models face challenges in mathematical geometric reasoning due to the scarcity of high-quality geometric data. To address this issue, synthetic geometric data has become an essential strategy. Current methods for generating synthetic geometric data involve rephrasing or expanding existing problems and utilizing predefined rules and templates to create geometric images and problems. However, these approaches often produce data that lacks diversity or is prone to noise. Additionally, the geometric images synthesized by existing methods tend to exhibit limited variation and deviate significantly from authentic geometric diagrams. To overcome these limitations, we propose GeoFM, a novel method for synthesizing geometric data. GeoFM uses formal languages to explore combinations of conditions within metric space, generating high-fidelity geometric problems that differ from the originals while ensuring correctness through a symbolic engine. Experimental results show that our synthetic data significantly outperforms existing methods. The model trained with our data surpass the proprietary GPT-4o model by 18.7\% on geometry problem-solving tasks in MathVista and by 16.5\% on GeoQA. Additionally, it exceeds the performance of a leading open-source model by 5.7\% on MathVista and by 2.7\% on GeoQA.
What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?
Chung, Jiwan, Joshi, Neel, Sharma, Pratyusha, Yu, Youngjae, Vineet, Vibhav
Multimodal reasoning models have recently shown promise on challenging domains such as olympiad-level geometry, yet their evaluation remains dominated by aggregate accuracy, a single score that obscures where and how models are improving. We introduce MathLens, a benchmark designed to disentangle the subskills of multimodal reasoning while preserving the complexity of textbook-style geometry problems. The benchmark separates performance into three components: Perception: extracting information from raw inputs, Reasoning: operating on available information, and Integration: selecting relevant perceptual evidence and applying it within reasoning. To support each test, we provide annotations: visual diagrams, textual descriptions to evaluate reasoning in isolation, controlled questions that require both modalities, and probes for fine-grained perceptual skills, all derived from symbolic specifications of the problems to ensure consistency and robustness. Our analysis reveals that different training approaches have uneven effects: First, reinforcement learning chiefly strengthens perception, especially when supported by textual supervision, while textual SFT indirectly improves perception through reflective reasoning. Second, reasoning improves only in tandem with perception. Third, integration remains the weakest capacity, with residual errors concentrated there once other skills advance. Finally, robustness diverges: RL improves consistency under diagram variation, whereas multimodal SFT reduces it through overfitting. We will release all data and experimental logs.
GeoSketch: A Neural-Symbolic Approach to Geometric Multimodal Reasoning with Auxiliary Line Construction and Affine Transformation
Weng, Shichao, Wang, Zhiqiang, Zhou, Yuhua, Lu, Rui, Liu, Ting, Teng, Zhiyang, Liu, Xiaozhang, Liu, Hanmeng
Geometric Problem Solving (GPS) poses a unique challenge for Multimodal Large Language Models (MLLMs), requiring not only the joint interpretation of text and diagrams but also iterative visuospatial reasoning. While existing approaches process diagrams as static images, they lack the capacity for dynamic manipulation--a core aspect of human geometric reasoning involving auxiliary line construction and affine transformations. GeoSketch integrates: (1) a Perception module that abstracts diagrams into structured logic forms, (2) a Symbolic Reasoning module that applies geometric theorems to decide the next deductive step, and (3) a Sketch Action module that executes operations such as drawing auxiliary lines or applying transformations, thereby updating the diagram in a closed loop. To train this agent, we develop a two-stage pipeline: supervised fine-tuning on 2,000 symbolic-curated trajectories followed by reinforcement learning with dense, symbolic rewards to enhance robustness and strategic exploration. To evaluate this paradigm, we introduce the GeoSketch Benchmark, a high-quality set of 390 geometry problems requiring auxiliary construction or affine transformations. Experiments on strong MLLM baselines demonstrate that GeoSketch significantly improves stepwise reasoning accuracy and problem-solving success over static perception methods. Work done during an internship at Hainan University. With the advent of Multimodal Large Language Models (MLLMs) (OpenAI, 2024; Comanici et al., 2025; Hong et al., 2025), Geometric Problem Solving (GPS) presents a unique challenge to MLLMs, demanding not only the joint understanding of text and diagrams but also rigorous, multi-step deductive reasoning (Zhang et al., 2023; Qiao et al., 2024; He et al., 2025). While modern MLLMs can ingest multimodal inputs, their reasoning output remains confined to static text. This limits the use of dynamic and visuospatial strategies in geometric problem solving, which becomes particularly evident in complex tasks requiring multi-stage manipulation.
A Survey of Deep Learning for Geometry Problem Solving
Ma, Jianzhe, Wang, Wenxuan, Jin, Qin
Geometry problem solving, a crucial aspect of mathematical reasoning, is vital across various domains, including education, the assessment of AI's mathematical abilities, and multimodal capability evaluation. The recent surge in deep learning technologies, particularly the emergence of multimodal large language models, has significantly accelerated research in this area. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our objective is to offer a comprehensive and practical reference of deep learning for geometry problem solving, thereby fostering further advancements in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.
LeanGeo: Formalizing Competitional Geometry problems in Lean
Song, Chendong, Wang, Zihan, Pu, Frederick, Wang, Haiming, Lin, Xiaohan, Liu, Junqi, Li, Jia, Liu, Zhengying
Geometry problems are a crucial testbed for AI reasoning capabilities. Most existing geometry solving systems cannot express problems within a unified framework, thus are difficult to integrate with other mathematical fields. Besides, since most geometric proofs rely on intuitive diagrams, verifying geometry problems is particularly challenging. To address these gaps, we introduce LeanGeo, a unified formal system for formalizing and solving competition-level geometry problems within the Lean 4 theorem prover. LeanGeo features a comprehensive library of high-level geometric theorems with Lean's founda-tional logic, enabling rigorous proof verification and seamless integration with Mathlib. We also present LeanGeo-Bench, a formal geometry benchmark in LeanGeo, comprising problems from the International Mathematical Olympiad (IMO) and other advanced sources. Our evaluation demonstrates the capabilities and limitations of state-of-the-art Large Language Models on this benchmark, highlighting the need for further advancements in automated geometric reasoning. In recent years, Large Language Models (LLMs) have made significant progress in mathematical reasoning, particularly in automated theorem proving (Bibel, 2013). Formal theorem proving is a crucial domain for ensuring the correctness of hard-to-verify proofs within theorem proving. Lean 4 (Moura & Ullrich, 2021), as a prominent proof assistant, provides a solid foundation for algebra and number theory through its extensive Mathlib library (mathlib community, 2020). It has been widely used in the formal verification of theorems within LLMs. However, Euclidean geometry, an essential component of mathematical reasoning and a frequent focus of competitions, remains relatively underexplored in Lean 4 community, Mathlib and automated theorem provers.